Skip to main content

What is a regular expression

A regular expression is a special set of letters and symbols that can be used to find a sentence from text that meets the format you want.

A regular expression is a style that matches a string from left to right in a body string. For example, a "regular expression" is a complete sentence, but we often use the abbreviated terms "regex" or "regexp". Regular expressions can be used to replace strings in text, to validate forms, to extract strings and so on.

Imagine you are writing an application, and you want to set up a user naming convention so that the username contains characters, numbers, underscores and hyphens, as well as limiting the number of characters so that the name doesn't look so ugly. We use the following regular expression to validate a username:

! regexp-cn

The above regular expression will accept john_doe, jo-hn_doe, john12_as. But it does not match Jo, because it contains an upper case letter and is too short.

1. Basic matchingโ€‹

A regular expression is actually the format used when performing a search, which consists of a combination of letters and numbers. For example: a regular expression the, which represents a rule: it starts with the letter t, followed by h, followed by e.

"the" => The fat cat sat on the mat.

Online practice

The regular expression 123 matches the string 123. It compares the regular expression with the input character by character.

Regular expressions are case-sensitive, so The will not match the.

"The" => The fat cat sat on the mat.

practice online

2. Metacharactersโ€‹

Regular expressions rely heavily on metacharacters.
Metacharacters do not represent their literal meaning, they have a special meaning. Some metacharacters have special meanings when written in square brackets. Here is an introduction to some of these metacharacters:

metacharactersdescription
.A full stop matches any single character except a line break.
[ ]character type. Matches any character within square brackets.
[^ ]negative character type. Matches any character except those in square brackets.
*Match >=0 duplicates of the character before the * sign.
+matches >1 repeated characters before the + sign.
?The character before the ? The character before the marker is optional.
{n,m}Match the characters before the num brackets (n <= num <= m).
(xyz)The set of characters, matching the exact equivalent of xyz.
Or operator, matches characters before or after the symbol.The
escape character, use to match some reserved characters [ ]( ) { } . * + ? ^ $ \ &#124;
^Matches from the beginning of the line.
$Match from the end.

2.1 The dot operator . โ€‹

. is the simplest example of a metacharacter. . matches any single character, but not a line break. For example, the expression .ar' matches an arbitrary character followed by a' and `r'.

".ar" => The car parked in the garage.

online exercise

2.2 Character setsโ€‹

Character sets are also called character classes. Square brackets are used to specify a character set. A hyphen is used in square brackets to specify the range of the character set. The set of characters in square brackets does not care about order. For example, the expression [Tt]he matches the and The.

"[Tt]he" => The car parked in the garage.

Online exercise

A period in square brackets means a full stop. The expression ar[.] matches the string ar.

"ar[.]" => A garage is a good place to park a car.

Online exercise

2.2.1 Negating character setsโ€‹

Generally ^ indicates the beginning of a string, but when it is used at the beginning of a square bracket, it means that the character set is negative. For example, the expression [^c]ar matches any character other than c followed by ar.

"car" => The car parked in the garage.

Online exercise

2.3 Number of repetitionsโ€‹

followed by the metacharacters +, * or ? , which specify the number of times a subpattern is matched. These metacharacters have different meanings in different contexts.

2.3.1 The * signโ€‹

The * sign matches the character preceding * more than or equal to 0 times. For example, the expression a* matches any character beginning with 0 or more a's, and because there are 0 of them, it matches all of them. The expression [a-z]* matches all strings in a line that start with a lowercase letter.

"[a-z]*" => The car parked in the garage #21.

Online exercise

The * character and the . characters can match all characters . *. * is used in conjunction with the symbol \s to indicate a space match, e.g. the expression \s*cat\s* matches a cat string starting with 0 or more spaces and ending with 0 or more spaces.

"\scat\s" => The fatcatsat on the concatenation.

online exercise

2.3.2 The + signโ€‹

The + sign matches the character before the + sign >= 1 times. For example, the expression 'c.+t' matches a string starting with the first letter 'c' and ending with 't', followed by any number of characters.

"c.+t" => The fat cat sat on the mat.

Practice online

2.3.3 ? signโ€‹

In regular expressions the meta character ? marks the character preceding the symbol as optional, i.e. it appears 0 or 1 times. For example, the expression [T]?he matches the strings he and The.

"[T]he" => The car is parked in the garage.

Online exercise

"[T]?he" => The car is parked in the garage.

Online exercise

2.4 The {} numberโ€‹

In regular expressions {} is a quantifier, often used to indicate the number of times a character or group of characters can be repeated. For example, the expression [0-9]{2,3} matches 2 to 3 digits of 0 to 9.

"[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10.0.

Online exercise

We can omit the second argument. For example, [0-9]{2,} matches at least two digits from 0 to 9.

If the comma is also omitted, it is repeated a fixed number of times. For example, [0-9]{3} matches 3 digits

"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0.

Online exercise

"[0-9]{3}" => The number was 9.9997 but we rounded it off to 10.0.

online exercise

2.5 (...) Feature groupโ€‹

A feature group is a set of subpatterns written in (...) is a set of subpatterns written in (...)'. For example, {}is used to indicate a specified number of occurrences of a preceding character. But if you prefix{}with a feature group, it means that the character is repeated N times throughout the group. For example, the expression(ab)*matches 0 or more consecutive occurrences ofab`.

We can also use the or character | in () to represent an or. For example, (c|g|p)ar matches car or gar or par.

"(c|g|p)ar" => The car is parked in the garage.

Online exercise

2.6 The | or operatorโ€‹

The or operator represents an or, and is used as a judgment condition.

For example (T|t)he|car matches (T|t)he or car.

"(T|t)he|car" => The car is parked in the garage.

Practice online

2.7 Transcoding special charactersโ€‹

The backslash \ is used to escape the character immediately following it in an expression. It is used to specify { } [ ] / \ + * . $ ^ | ? These are special characters. If you want to match these special characters you have to precede them with a backslash \.

For example . is used to match all characters except newlines. If you want to match the . would be written as \. .

"(f|c|m)at...?" => The fat cat sat on the mat.

Online exercise

2.8 Anchor pointsโ€‹

In regular expressions, anchors are used when you want to match a string with a specified beginning or end. ^ specifies the beginning, $ the end.

2.8.1 The ^ signโ€‹

^ is used to check that the matching string is at the beginning of the matched string.

For example, using the expression ^a in abc will give you the result a. But if you use ^b you will not get any result. This is because the string abc does not start with b.

For example, ^(T|t)he matches a string starting with The or the.

"(T|t)he" => The car is parked in the garage.

Online exercise

"^(T|t)he" => The car is parked in the garage.

Online exercise

2.8.2 The $ signโ€‹

Similar to the ^ sign, the $ sign is used to match whether the character is the last one.

For example, (at\.) $ matches a string ending in at..

"(at.)" => The fat cat. sat. on the mat.

Online exercise

"(at.) $" => The fat cat. sat. on the mat.

Online exercise

3. Abbreviated character setsโ€‹

Regular expressions provide abbreviations for some common character sets. These are as follows:

shorthanddescription
.all characters except newlines
\wmatches all alphanumeric characters, equivalent to [a-zA-Z0-9_]
\WMatch all non-alphanumeric, i.e. symbols, equivalent to: [^\w]
\dMatching numbers: [0-9]
\Dmatches non-numbers: [^\d]
\sMatches all whitespace characters, equivalent to: [\t\n\f\r\p{Z}]
\Smatches all non-whitespace characters: [^\s]

4. Pre- and post-association constraints (pre- and post-checking)โ€‹

Both pre and post constraints are part of non-capture clusters (used to match formats that are not in the match list). A pre-constraint is used to determine if the format being matched is after another identified format.

For example, if we want to get all the numbers that follow the $ symbol, we can use the forward-backward constraint (? <=\$)[0-9\.] *. This expression matches the beginning of $, followed by 0,1,2,3,4,5,6,7,8,9,. These characters can occur more than or equal to 0 times.

The preceding and following association constraints are as follows:

symboldescription
? =predecessor-constraint-presence
?!predicate constraint-exclusion
? <=Posterior Constraint-Presence
? <!Posterior Constraint-Excluded

4.1 ? =... Pre-constraints (exist)โ€‹

? =... preconstraint (exists), which means that the first part of the expression must be followed by ? =... defined after the expression.

Only the first part of the expression will be hidden from the return result. To define a preconstraint (presence) use (). Use a question mark and an equal sign inside the parentheses: (? =...) .

The contents of the preceding constraint are written after the equal sign in the parentheses. For example, the expression [T|t]he(? = \sfat) matches The and the, and in the parentheses we define the antecedent constraint (which exists) (? =\sfat) , i.e. The and the are immediately followed by (space)fat.

"[T|t]he(? =\sfat)" => The fat cat sat on the mat.

Practice online

4.2 ?!... Pre-constraint-exclusionโ€‹

Preconstraint-exclusion ?! is used to filter all matches, without the defined format following the filter condition The definition of prefix-exclude is the same as prefix-constraint(exist), except that = is replaced by ! which is (?!...) .

The expression [T|t]he(?! \sfat) matches The and the, and is not followed by (space)fat.

"[T|t]he(?! \sfat)" => The fat cat sat on the mat.

Practice online

4.3 ? <= ... Posterior constraint-presenceโ€‹

Posterior constraint-existence Notated as (? <= ...) is used to filter all matches by the format defined before it. For example, the expression (? <=[T|t]he\s)(fat|mat) matches fat and mat, and is preceded by the or the.

"(? <=[T|t]he\s)(fat|mat)" => The fat cat sat on the mat.

practice online

4.4 ? <!... Posterior constraint-exclusionโ€‹

Posterior constraint-exclusion is written as (? <!...) is used to filter all matches by a format that is not preceded by a definition. For example, the expression (? <! (T|t)he\s)(cat) matches cat without being preceded by the or the.

"(? <! [T|t]he\s)(cat)" => The cat sat on cat.

practice online

5. signsโ€‹

Flags are also called modifiers, because they can be used to modify the search result of an expression. These flags can be used in any combination, and are part of the overall regular expression.

flagsdescription
iignores case.
gGlobal search.
mMulti-line: Anchor metacharacters ^ $ work at the beginning of each line.

5.1 Case Insensitiveโ€‹

The modifier i is used to ignore case. For example, the expression /The/gi indicates a global search for The, which becomes a search for the and The when followed by i, which modifies the condition to be case insensitive, and g indicates a global search.

"The" => The fat cat sat on the mat.

online exercise

"/The/gi" => The fat cat sat on the mat.

online exercise

The modifier g is often used to perform a global search, that is, to return not just the first match, but all of them. For example, the expression /. (at)/g means search for any character (except newlines) + at, and return all results.

"/. (at)/" => The fat cat sat on the mat.

practice online

"/. (at)/g" => The fat cat sat on the mat.

online exercise

5.3 Multiline modifiers (Multiline)โ€‹

The multiline modifier m is often used to perform a multi-line match.

As previously described (^,$) is used to check if the formatting is at the beginning or end of the string to be tested. But if we want it to work at the beginning and end of each line, we need the multi-line modifier m.

For example, the expression /at(.) ? $/gm means that the string to be tested is searched at the end of each line for at followed by one or more `. ' at the end of each line, and returns the full result.

"/.at(.) ? $/" => The fat cat sat on the mat.

Practice online

"/.at(.) ? $/gm" => The fat cat sat on the mat.

Online exercise

Contributionsโ€‹

Thanks to Learn-regex for the project

Licenseโ€‹

MIT ยฉ Zeeshan Ahmad